Non-record: Stable Growing Recurrence, Progressive Depth + Error Feedback (1.1163 BPB) by nestamidavaine · Pull Request #1231 · openai/parameter-golf

nestamidavaine · 2026-04-01T20:29:59Z

Recurrent Depth with Progressive Pass Growth + Error Feedback

val_bpb: 1.1163 (3-seed mean, std 0.0013) | ~15.96 MB | 8×H100 SXM

A non-record submission targeting significant improvement over PR #549 (LeakyReLU² baseline, 1.1194 mean bpb). Achieves -0.0031 bpb vs that baseline. For an in-depth analysis of depth recurrence in this competition, see PR #363. I targeted 549 when I started building this solution, after I finished evaluation the new improved model has been published to the leaderboard. However I believe the experiments here can be applied to any model to improve performance, with the largest benefit for submissions using TTT since the recurrance makes use of the 10 available minutes of evaluation time very effectively.

Results (8×H100 80GB SXM, PyTorch 2.9.1+cu128)

Seed	step_avg	steps	Pre-TTT bpb	Post-TTT bpb	TTT gain	TTT time	Artifact
1337	83.5ms	6,328	1.1353	1.1157	-0.0196	566s	15,909,018
42	83.5ms	6,334	1.1372	1.1177	-0.0195	579s	15,897,530
2025	83.4ms	6,334	1.1351	1.1155	-0.0197	588s	15,995,558
Mean	83.5ms	6,332	1.1359	1.1163 (std 0.0013)	-0.0196	~578s

We significantly beat the PR #549 LeakyReLU² baseline (1.1194 mean bpb / 1.8901 nats) by -0.0031 bpb / -0.0053 nats across all three seeds (1.1163 mean bpb / 1.8848 nats), achieving the goal we set out with.

Progressive Recurrence Architecture

   ┌───────────┐               ┌───────────┐                 ┌───────────┐
   │           │               │           │                 │           │
   │   Tail    │               │   Tail    │                 │   Tail    │
   │  [7-10]   │               │  [7-10]   │                 │  [7-10]   │
   │           │               │           │                 │           │
   ├───────────┤               ├───────────┤╮                ├───────────┤╮
   │           │  4500 steps   │           ││   1000 steps   │           ││
   │   Core    │ ───────────>  │   Core    ││  ──────────>   │   Core    ││
   │  [4-6]    │               │  [4-6]    │2x               │  [4-6]    │3x
   │           │               │           ││                │           ││
   ├───────────┤               ├───────────┤╯                ├───────────┤╯
   │           │               │           │                 │           │
   │   Stem    │               │   Stem    │                 │   Stem    │
   │  [0-3]    │               │  [0-3]    │                 │  [0-3]    │
   │           │               │           │                 │           │
   └───────────┘               └───────────┘                 └───────────┘

   11 layers                   14 layers                     17 layers
   (steps 0-4499)              (steps 4500-5499)             (steps 5500+, eval)

The Problem: Depth Recurrence Fails Under Competition Constraints

PR #363 demonstrated that depth recurrence — reusing a shared block of transformer layers multiple times — saves parameters but hurts bpb under the 10-minute / 16MB competition constraints. Their controlled experiments showed a +0.025 bpb gap (looped worse) due to two compounding taxes:

Quantization error amplification. When shared weights are quantized to int6, the quantization error is injected at every pass. After K passes through the same core, the cumulative error grows superlinearly. Additionally hidden state magnitudes tend to explode with to many recurrent passes through a block if we do not stabilize this.
Step time overhead. Each additional recurrence pass adds forward/backward compute. With 4 passes, +32ms/step translates to ~1200 fewer training steps in the 600s budget.

Our Solution: Late Growth + Contractive Stabilization

We address both taxes by growing recurrence depth progressively during training and stabilizing the recurrent dynamics.

Progressive Pass Schedule (Late Growth)

The key insight: start training with 1 pass and gradually add passes late in training. This preserves fast step times for the majority of training (83.5ms/step at 1-pass vs ~95ms at 3-pass), maximizing the total number of gradient updates within the 600s wallclock budget. The schedule:

Step range	Passes	Effective layers	step_avg
0–4499	1	11	~83.5ms
4500–5499	2	14	~85.5ms
5500–6328	3	17	~91ms

This reduces the step/capacity trade-off that normally makes recurrence impractical under competition constraints. We get ~6,330 training steps (vs ~7,180 for the flat LeakyReLU baseline), but the final model has 17 effective layers at eval vs the baseline's 11.

We also tested training with 4 recurrence passes. While 4-pass shows better per-step loss, the additional step time cost (~105ms/step) means fewer total steps within the wallclock budget. Under the competition's 600s constraint, 3-pass wins the step/capacity trade-off, the extra training steps from the faster 3-pass schedule outweigh the marginal per-step quality gain from 4 passes.

Learnable Residual Scaling

Per-pass learnable scalars contract the residual update, preventing hidden state magnitude growth across passes:

$$h_{k+1} = h_k + \alpha_k \cdot F(h_k + c_k)$$

where $\alpha_k$ is initialized to 0.5 and learned during training. This ensures the recurrent dynamics are contractive — later passes refine rather than amplify.

Error Feedback Module

A low-rank correction compensates for accumulated error before each recurrence pass:

$$e_k = U(V^\top h_k), \qquad c_k = \mathrm{diag}(d) \cdot e_k$$

where $U, V \in \mathbb{R}^{d \times r}$ with rank $r=2$ and $d \in \mathbb{R}^d$ is a learnable diagonal. The correction is zero on pass 0 (no prior error to correct) and active on subsequent passes. Total parameter overhead: 2,560 params (negligible vs 26.7M model params).

The feedback module is important but not strictly required — we confirmed that stable training is possible without it, and even running eval-only without feedback works, at a cost of ~0.001 bpb higher. The feedback module's main contribution is providing the recurrent passes with an error signal about the previous iteration's residual.

Jacobian Proxy Loss (Stabilizer)

A regularization term penalizes hidden state growth ratio above 1.0, enforcing contractive dynamics without computing the full Jacobian:

$$L_J = \lambda \cdot \text{ReLU}!\left(\frac{\lVert h_{k+1} - h_k \rVert}{\lVert h_k \rVert + \epsilon} - 1\right)^2$$

with $\lambda = 0.01$. This is a cheap finite-difference proxy for the spectral norm of the Jacobian $\partial h_{k+1}/\partial h_k$, encouraging it to stay below 1 (contractive map). The model learns to adhere to this quickly and it does not seem to effect early training dynamics. However we did see better results with 0.01 compared to 0.1 for Lambda, potentially since the restriction of 0.1 is to high, we don't always need contractive layers with only 3x recurrance, but we do need it to not explode.

This loss term is critical for training stability. Without it, gradient norms and hidden state magnitudes explode during the multi-pass phases, destabilizing training. The proxy loss keeps the recurrent dynamics well-behaved without the computational cost of full Jacobian computation.

Note: the jacobian proxy loss is only added to the training loss — it does not affect evaluation scoring, which uses pure cross-entropy.

Legal TTT Protocol

Score-first legal TTT following PR #461:

Val tokens split into 1,893 non-overlapping 32K-token chunks. Here 3 pass recurrance is vital since with 4 passes we must increase chunk size to fit within the time limit.
For each chunk:

SCORE: Sliding window eval under torch.inference_mode() — no gradients, no weight mutation
TRAIN: SGD on the already-scored chunk. 3 epochs, all blocks unfrozen, cosine LR decay, grad clip 1.0

Last chunk scored but never trained on

Parameter	Value
Chunk size	32,768 tokens
Optimizer	SGD + momentum(0.9)
Learning rate	0.002 (cosine decay)
Epochs per chunk	3
Frozen blocks	None (all blocks adapt)
Gradient clip	1.0
Eval passes	3 (matching final training phase)

Timing Budget

Phase	Time
Training (wallclock cap)	600s (10 min)
Standard eval (int6 + sliding window)	~3s
Legal TTT (score-first + adaptation)	~578s
Total eval	~581s (< 10 min)

Architecture

Built on the PR #414 stack with PR #399 Parallel Muon:

Component	Setting
Layers	11 unique (512d, 8H, 4KV)
Effective layers (eval)	17 (4 stem + 3 core ×3 + 4 tail)
MLP	3× with LeakyReLU(0.5)²
BigramHash	512
XSA	Last 4 layers
RoPE	Partial (16/64 dims)
LN Scale	1/√(layer+1)
VE128	Layers 9-10
Recurrence core	Layers 4-6, progressive 1→2→3 passes
ResidualScale	Per-pass learnable, init 0.5
Error Feedback	Diagonal mode, rank 2, 2560 params
Jacobian proxy	λ=0.01
Weight avg	EMA(0.997) + SWA(every 50)
Quantization	Late QAT (threshold 0.15) + GPTQ-lite int6 + lzma
Warmup precompilation	All pass×QAT graph variants compiled during 20 warmup steps
Optimizer	Parameter Banking + Parallel Muon

Run Command

cd records/track_non_record_16mb/2026-03-26_Stable_Growing_Recurrance
bash run_earlyqat.sh  # Single seed (set SEED env var)

Key flags:

torchrun --standalone --nproc_per_node=8 train_gpt.py \
    --feedback-mode diagonal --feedback-rank 2 \
    --residual-scale-init 0.5 \
    --jacobian-proxy-weight 0.01 \
    --no-interpass-rmsnorm

Tricks

Graph Precompilation Warmup

torch.compile is lazy — it only compiles a new graph variant the first time it's encountered. With progressive recurrence (1→2→3 passes) and late QAT, this means the training loop would hit compilation stalls at step 4500 (2-pass), step 5500 (3-pass), and again when QAT enables. Under a 600s wallclock cap, these stalls are expensive.

The fix: precompile all graph variants during warmup before training starts. During the 20 warmup steps:

The last few warmup steps cycle through each num_passes variant (2-pass, 3-pass) and each with QAT toggled on
This forces torch.compile to eagerly compile every forward/backward graph that will appear during training
After warmup, model weights and optimizer states are restored to their initial values — the warmup steps have zero effect on the actual training run

This ensures the training loop runs at full speed from step 0 with no compilation jitter when passes change or QAT kicks in.

Code Minification with python-minifier

The original training script was 88,253 bytes, which caused seed 2025 to exceed the 16MB submission limit (16,025,625 bytes). After removing dead code paths (eval-only mode, int8 quantization, unused feedback variants, verbose logging), the file was still too large.

python-minifier with --no-rename-locals shrinks the code aggressively (whitespace, docstrings, constant folding) while preserving local variable names — critical because the training script uses string-based lookups for state_dict keys and named_parameters. This brought the file from 68,435 bytes down to 58,186 bytes, comfortably fitting all seeds under the 16MB decimal limit.

Note: The code was minified after all three seed runs completed, so the log files report Code size: 88253 bytes and correspondingly larger Total submission size values. The actual submission uses the minified 58,186-byte script — the correct per-seed totals are listed in submission.json and the results table above.

Credits

Base model: PR #414 by @signalrush
Optimizer (Parameter Banking + Parallel Muon): PR #399 by @abaybektursun
LeakyReLU² activation: PR #493 by @parinzee
TTT recipe: PR #461 by @Christopher-Lee-McClendon
Depth recurrence analysis: PR #363 by @evangelinehelsinki

Made with Cursor

…cord) Progressive depth recurrence (1->2->3 passes) with diagonal error feedback and jacobian proxy stabilization. Late growth preserves fast step times for most of training, avoiding the step/capacity trade-off that makes naive recurrence impractical under competition constraints. 3-seed mean: 1.1163 bpb / 1.8848 nats (std 0.0013) Baseline PR openai#549: 1.1194 bpb / 1.8901 nats Improvement: -0.0031 bpb / -0.0053 nats 8xH100 SXM, 600s wallclock, all artifacts under 16MB decimal.

nestamidavaine changed the title ~~Stable Growing Recurrence: Progressive Depth + Error Feedback (non-record)~~ Non-record: Stable Growing Recurrence, Progressive Depth + Error Feedback Apr 1, 2026

nestamidavaine changed the title ~~Non-record: Stable Growing Recurrence, Progressive Depth + Error Feedback~~ Non-record: Stable Growing Recurrence, Progressive Depth + Error Feedback (1.1163 BPB) Apr 2, 2026

nestamidavaine mentioned this pull request Apr 2, 2026

Record: ParallelResiduals + MiniDepthRecurrence, 1.1063 BPB / 1.8679 nats, -0.0072 vs PR #1179, -0.0143 vs merged SOTA #1204

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-record: Stable Growing Recurrence, Progressive Depth + Error Feedback (1.1163 BPB)#1231

Non-record: Stable Growing Recurrence, Progressive Depth + Error Feedback (1.1163 BPB)#1231
nestamidavaine wants to merge 1 commit intoopenai:mainfrom
nestamidavaine:submission/stable-growing-recurrence

nestamidavaine commented Apr 1, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

nestamidavaine commented Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Recurrent Depth with Progressive Pass Growth + Error Feedback

Results (8×H100 80GB SXM, PyTorch 2.9.1+cu128)

Progressive Recurrence Architecture

The Problem: Depth Recurrence Fails Under Competition Constraints

Our Solution: Late Growth + Contractive Stabilization

Progressive Pass Schedule (Late Growth)

Learnable Residual Scaling

Error Feedback Module

Jacobian Proxy Loss (Stabilizer)

Legal TTT Protocol

Timing Budget

Architecture

Run Command

Tricks

Graph Precompilation Warmup

Code Minification with python-minifier

Credits

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

nestamidavaine commented Apr 1, 2026 •

edited

Loading